Good afternoon, and thank you for joining us.
We are excited to be here taking part in DEF CON China 1.0.
This talk is IPv666, Address of the Beast.
I am Mark, and this is Chris.
Hey, what's going on, everybody?
My name is Chris Grayson.
I'm originally from Atlanta, Georgia.
I went to Georgia Tech a few times.
I used to be the head of the Georgia Tech Hacking Club.
I've been the head of the red team at Snapchat,
a research scientist,
and presently I'm a security engineer at Bird Rides.
And my name is Mark Newlin.
I am a self-taught hacker.
My security background is largely in the wireless space.
I've done a lot of work doing reverse engineering
using software-defined radios,
and at this point I've discovered and published
wirelessly exploitable vulnerabilities in devices
from somewhere north of 25 vendors.
In my free time, I like to compete in DARPA challenges.
I've placed top three in multiple of them.
I am a prior member of the red team at Snapchat,
and I currently work as a security engineer at Bird Rides.
And so what are we going to be talking about today?
And the answer, in short, is the future.
And if you didn't know this,
Chris and I recently took a trip to the future,
and what we found there is IPv6-connected devices
as far as the eye can see.
Everywhere.
And so we decided to build out an open-source tool set
for discovering these IPv6-connected devices
on the open Internet.
And we've been working on this project
for about a year and a half now,
and it's been quite the learning curve for us.
We've made a lot of mistakes.
We've learned a lot of lessons
and really grown our skill set and knowledge in the space.
And so this talk is a story of how we got
from where we started a year and a half ago
to where we are now with this pretty neat open-source tool.
And this is a fairly technical talk,
and so in the interest of translation clarity and time,
we are glossing over some of the deep technical specifics.
All this data is still in the slides,
and if there are any questions you have about our talk afterward,
we are happy to answer any questions
and fill in any details.
And I also want to point out that neither Chris or I
are network engineers,
and we've been learning about IPv6
through the course of this research.
And we probably got stuff wrong.
When we did, please let us know.
We are here to learn just as much as we are
to share our experiences.
So we're going to start with some background on IPv6
and our motivations for doing this research.
We're then going to look at the difficulties
of scanning the IPv6 address space.
We're going to look at some techniques we've used
with varying degrees of success
for discovering and predicting new IPv6-connected addresses.
We are going to look at the latest iterations
for our prediction algorithm.
We're then going to look at the latest iterations
of our software where we've implemented functionality
to actually crowdsource a public global data set
of IPv6 addresses.
We'll take a look at the results of our scanning
and some port scan results.
We'll then look at the tool we built,
which is IPv666, and then we'll have our conclusion
and hopefully a few minutes for Q&A.
So for a bit of background, this here is a chart
of the percentage of users who have connected
to Google using IPv6.
And this is over a time span of 10 years,
so on the far left, we have 2009.
On the far right, we have 2019.
And we can see that a decade ago,
nearly nobody was connected to Google over IPv6.
This has grown over time, and now we're approaching
one third of Google traffic is happening over IPv6.
And this is just a single internet company,
but it's representative of the fact
that IPv6 is growing in adoption,
and as security practitioners, we are really needing
to spend more time learning about the security of IPv6.
So we got interested in this research
as an offshoot of some other work we did
a couple of years ago.
We discovered a number, in fact, 26 distinct vulnerabilities
in Comcast set-top boxes and cable modems
with our friend Logan Lam, and we presented this research
at DEF CON 25.
And one of the big takeaways is the most severe
vulnerabilities we found could only be exploited
over IPv6, and to exploit them, we had to have knowledge
of the target device's IPv6 address.
And this was the first time that we had this hint
that, hey, there's something interesting about IPv6
and security value of being able to discover
these IPv6 addresses.
And I'll illustrate this with one of our favorite
vulnerabilities, which is a Send2TV exploit.
So if you are a customer of Comcast,
you have a web admin portal on your modem,
which you can access on your local network.
There is a separate web admin portal,
which Comcast can access over the internet,
but through a specific IPv6 address on the modem,
and only if the request is coming from a protected
Comcast network segment.
And in addition to looking at modems,
we looked at the Comcast set-top boxes,
and there is a service called Xfinity Send2TV.
And this service allows you to sign into the Comcast website,
put in a URL, and have this URL loaded
in a web browser on your TV.
What we discovered is that these set-top boxes actually
exist in the same protected Comcast network segment.
So this meant that we could take the IPv6 address
of any customer's modem, put it into the service,
and load up their web admin panel for their modem on our TV
and sign in with hard-coded credentials.
So we could actually remotely administer and configure
any customer's modem, provided that we
knew the IPv6 address of their modem.
And this got us really intrigued with this problem
of doing IPv6 address discovery and just learning more
about IPv6 security in general.
And so we do this initial research.
It's a lot of fun.
Our research is up on GitHub, if anybody
would like to take a look.
But we start thinking, like, huh, IPv6 addresses,
they seem to be pretty integral to these vulnerabilities
that we're finding.
Let's dig in a little bit more.
So we did some cursory research around IPv6.
And a number of things kind of caught our eye
as things that could impact security posture over IPv6
versus IPv4.
The first of this is something called
Slack, which is the stateless address automatic configuration
protocol.
In IPv4, typically you need a DHCP server or a dynamic host
configuration protocol server to give you
an IP address that then enables you to talk over layer 3.
This is a bit of a problem, a bit of an architectural mishap
from the beginning of the internet.
And so we kind of got to this point
that we shouldn't really have something like that.
So in IPv6, you are able to actually provision an IP
address for yourself.
So in order for you to have an IP address
that you can start communicating with,
you no longer have a dependency on a DHCP server,
just that your networking equipment and host operating
system can support it.
And so that's kind of further complicated
by the fact that all modern operating systems
and modern networking equipment both support and prefer IPv6.
So we're talking.
Linux, OSX, Windows, and then pretty much
any modern consumer gateway or consumer premise equipment
is going to support IPv6 out of the box.
And you'll find in a few slides when
we're talking more about it that, really, if your device
has the ability to establish an IPv6 connection
to a remote host, it's going to prefer that over IPv4.
Another big change is that there's no such thing
as NAT by default in IPv6.
So you know.
You go home.
You connect to your home Wi-Fi.
And the main reason that nobody is
able to route traffic to your device at that point
is because you're behind NAT.
You have a 10 dot star address or a 192.168 star address.
You're in this private network space.
So in IPv6, this is no longer the default.
There's still NAT by definition, but it's
no longer what you're going to get out of the box.
And it's really kind of discouraged to be used.
And so when we were first starting this research,
Mark was sitting at home at his apartment.
I was sitting at home at mine.
And he was able to ping the Chromecast on my home network
from his apartment.
And that was kind of surprising to me.
So another thing, let's say that you have thought
about IPv6 before.
And you have a host that you want to make sure
is sufficiently protected.
And so you're like, I want to make sure
that everybody trying to talk to me is firewalled off.
And you say, OK, let me look at my iptables rules,
iptables dash L. Sure enough, your input
chain is dash J reject.
Everything is just getting dropped.
Well, it turns out that iptables has nothing
to do with IPv6.
iptables is just IPv4, whereas iptables is the utility
that you would want to configure your IPv6 firewall rules.
So it kind of leaves us scratching our heads thinking,
like, how many people think that they have sufficiently
protected their hosts when, in fact, they don't actually
have any IPv6 firewall rules?
So as somebody that has a very, very
fairly significant red teaming background,
my experience with ICMP, the Internet Control Message
Protocol, is like, it's used for ping scanning.
That's the main thing I'm using it for.
We got tracer out, whatever.
But for something that's called the Internet Control Message
Protocol, it seems like it's not all that important.
In IPv6, that's no longer the case.
So in IPv4, you have ARP, the address resolution protocol.
It's how you map layer two addresses
to layer three addresses.
It enables you to go up the stack in the OSI model
and actually start speaking over
your network.
In IPv6, that is replaced with the neighbor discovery
protocol that does the exact same thing.
But the neighbor discovery protocol is actually a subset
of ICMPv6.
So whereas your networks might work fine
if you block ICMP everywhere, or you just shut off ICMP,
with ICMPv6, that's not going to be the case,
because you're required to allow it in order for you
to actually go up the OSI model.
And then lastly, broadcast is no longer
a thing in IPv6.
So in IPv4, it's used for a number of different protocols.
And in IPv4, there's this thing known as multicast,
which you can basically send packets
to an address that defines the distance
that you want the message to be propagated
and the hosts that you would like to receive it.
But in IPv4, this wasn't largely supported or used
or implemented.
So it really wasn't something that you saw anywhere.
Well, IPv6, out of the box, it's not.
It requires this to be to spec.
So your networking equipment and your host operating system
is going to allow you to use multicast
and take a single packet and route it to, potentially,
a large number of hosts.
And so we read up on all of these things,
and we're like, OK, this sounds like it might be problematic.
We should probably look into it.
And then we immediately run into the problem
of address discovery, which is the substance
of the rest of this talk.
And so as Chris explained, in order
to test our hypotheses that the security posture of IPv6
is potentially worse than the security posture of IPv4,
we need to discover a corpus of addresses to test this against.
And we quickly run into this problem of scale.
With IPv4, we have a 32-bit address base,
which gives us just shy of 4.3 billion addresses.
And this is a lot of addresses, but it's perfectly conceivable
for a single host to do a port scan across one TCP port
on the entire IPv4 address base in a matter of hours.
So while it's a large number of addresses,
we can just brute force the space and do our research
that way.
With IPv6, we grow to a 128-bit address base.
And this number on the bottom of the slide
is the total number of possible IPv6 addresses.
And I have no idea how to actually pronounce
this number.
It has 13 commas, and it is far too many addresses for us
to ever actually brute force.
And so to try and address this, we
have to look at this problem with PSLAC.
So PSLAC is an extension of the stateless address
auto-configuration protocol.
So with SLAC, this is an extension
And this is an algorithm which takes in your MAC address
of your network interface, as well as the network bits
of the network you're connected to from your IPv6 address,
and generates an address for you.
So with an IPv6 address, we can think of it
as structured in two parts.
We have the network bits, which are commonly the lower 64 bits
of the address.
And this represents the network that your device
is connected through.
Then we have the host bits, which
are commonly the upper 64 bits of the address, which
represent the device identifier of the device that you're
connected to.
So with PSLAC, this is functionally
a transform between the MAC address of your network
interface and these host bits.
And this represents a potential privacy concern,
because even if you are connected
to multiple networks, such as first at home, then at work,
then a coffee shop, the network bits are going to change.
But the host bits, because they're based on your MAC
address, are going to remain the same.
So if you visit a website from multiple different locations
with a SLAC-generated IPv6 address,
that website can determine that it
was the same device that connected from these multiple
locations.
So PSLAC introduces pseudo-random entropy
into the SLAC algorithm and allows
you to have a distinct set of host bits for each connection
that you make.
And this is great from a privacy standpoint,
because you can no longer track a device in the same way.
But it's bad from the address discovery and modeling
standpoint, because with this high amount of entropy,
we don't have any clear ways to model the address generation.
And so we can actually break this problem down into two parts.
We have first the problem of identifying the high entropy
PSLAC hosts, and then the other problem
of identifying the low entropy non-PSLAC hosts.
And so to address this, we are trying
to do some honeypotting to identify these PSLAC hosts.
And the concept is that instead of trying
to predict these addresses and make outgoing connections
to them, we set up web services and network services
and use some tricks to get these devices to connect to us.
So because the search space is too big, again,
we need to have these devices connect to us.
We set up a server with a DNS server, SMTP server, and a IPv6 server.
And we use some different techniques
to drive traffic to these network services.
So first, we set up a honeypot DNS bind server.
And when we first did this, we didn't really
understand IPv6 and all its minutiae.
And we thought that we would need to funnel traffic that
was connecting over IPv4 to IPv6.
So we did some fancy stuff with the DNS records.
And in hindsight, this actually was not necessary at all,
because what we learned is that on modern operating systems,
they will actually prefer IPv6 when it's available.
So in the end, we just had to do
the same thing with everybody who could connect over IPv6,
connecting directly over IPv6.
So here we have a plot of the number of DNS requests
we've received over time on this honeypot DNS server.
And we can see we have a number of spikes here.
And these spikes are where we paid for traffic from ad
campaigns from an ad network called Pop Ads.
And Pop Ads is a service where you give them money,
and then they drive traffic to your website.
And we selected Pop Ads because they
were the cheapest possible way to get this amount of traffic.
And this is a graph of a two-minute plot of traffic
from Pop Ads.
And we gave Pop Ads $10 or $15 in this case,
and received between 40,000 and 50,000 requests
in a matter of two minutes.
And a kind of interesting note about this,
we don't know exactly where this traffic comes from,
but one of the common referring pages we saw
was just a blank page with an anchor tag to our website
and a crypto miner JavaScript payload.
Perfectly legitimate.
That's totally legit.
You know, how many websites do you commonly visit
where it's just a link on a page
and then a cryptocurrency miner?
That's most of the sites that I visit.
Yeah, I mean, pretty much every site I ever go to.
Yeah, yeah.
Yeah.
And so we also set up a Honeypot web server.
And this was initially running at IPv6.exposed.
This URL is now where we have our IPv6 search portal,
which you can then check out.
So for the Honeypot web server, we have this serving
over both IPv4 and IPv6.
The idea was that we would have images on the site serving
over IPv6 so that if users connected over IPv4,
we could still potentially get IPv6 requests
from these images.
And we also set up a WebRTC JavaScript payload,
which would enumerate the private IPv4 addresses
of the client, as well as our IPv6 addresses,
and then post those back to a data store we controlled.
And to try and generate traffic for this, we used pop ads.
And we also posted links all over social media.
We thought, hey, we're cool people.
People will click on these links,
but we're not popular at all, it turns out.
And so we got almost no traffic
from the organic social media efforts.
And here we have a plot of the number of requests over time
on the top for the HTTP requests
and on the bottom for the WebRTC postbacks.
And what we can see is on the top,
this is on the scale of tens of thousands,
but on the bottom, the postbacks,
this is on the scale of thousands.
And this tells us that most of the users,
or rather most of the clients,
who requested this website did not actually execute
this JavaScript payload.
So while we had some apparent actual browsers,
a lot of these may have been automated scripts
from the pop ads network.
And this is over the course of a 10-month period.
And we can see we have these spikes
where we paid for the pop ads campaigns,
but very, very little residual organic traffic
from the social media efforts.
So after setting up the DNS server and the web server,
we set up a Honeypot SMTP server.
And we thought we were being clever here.
We thought we could post email addresses
all over the Internet, get DNS and SMTP requests that way,
sign up for spam email lists,
get DNS and SMTP requests that way.
And in the end, we only got a small number of requests
from a small number of posts.
They were primarily infrastructure email servers
from places like Hotmail and Yahoo.
So the SMTP Honeypot was not very good
in terms of results.
So after the 10-month Honeypotting experiment,
we collected 92,000 addresses.
We ended up spending closer to $1,000 in costs.
And this was good in that we had non-zero addresses,
but it was expensive, took a long time,
and we weren't super thrilled with the results.
So we have this set, and we decided to do
an ICMP ping scan across these 92,000 addresses
to see after 10 months how many of them are still alive.
And it turns out that almost none of them were.
And this was another one of our good learning experiences
when we discovered this thing called
ephemeral IPv6 addresses.
And just how we have ephemeral UDP and TCP ports
for outgoing connections,
we have ephemeral IPv6 addresses for outgoing connections.
And it turns out that most of these 92,000 addresses
were actually ephemeral connection-specific
ephemeral addresses, which means they were used
this one time and not persistent,
so we couldn't actually connect back to them.
And this means we ended up spending 10 months
and $1,000 and had basically nothing
but a painful learning experience to show for it.
And so you know that your research project
is going really well when you spend $1,000
and 10 months of effort and have nothing to show for it.
I highly recommend it. You should try it.
So we do all of this attempting to find addresses
that have all of this entropy in them.
We fail miserably at it.
So then we turn our attention to,
okay, what about those addresses
that seem to have more structure in them?
And we kind of came to that conclusion
by looking at these IP addresses.
And I mean, if you take a big list,
there's plenty of public data sets
that have lots and lots of IPv6 addresses in them.
And so we pulled a bunch of them down,
put all the addresses together,
and we were looking through these files.
It looks like there's a lot of structure in these addresses.
It's as if, you know, you have, like,
iterating on the host bits,
so maybe that's like DHCP leases.
There's various byte boundaries
that tend to have, like, common iterations for networks.
But it feels like
there's a lot of structure in these addresses.
So, of course,
as folks that like computers
and like data analysis,
how are we going to solve this problem
of figuring out how to predict
new IPv6 addresses?
Well, obviously, machine learning is the answer.
Machine learning is the answer to everything.
So, as Mark said before,
we are not network engineers.
We're barely engineers at all.
Yeah.
We're not even engineers.
We're not network engineers,
and we're definitely not machine learning experts.
But we have a friend who is a machine learning expert,
and we got him involved,
and he was like,
oh, I think I know what you should use.
It's an autoencoder.
And so I'm going to describe that in the way
that it was described to me.
So, if you think about the way the human eye works,
where light bounces around,
it bounces off of objects,
and then it enters your lens,
and then hits the back of your eye and your retina,
and then your brain is able to interpret that as sight,
well, what happens
when the lens is partly damaged?
You don't have perfect vision,
maybe things get blurry,
but the point being that the input data,
the light waves,
once they pass through the lens,
become distorted,
and it kind of alters the data that you're actually receiving.
So an autoencoder is like an imperfect lens.
Basically, you're going to create a lens
that has a bit of error in it
so that when you pass data through this lens,
it messes that data up.
It transforms that data into new data.
But the interesting thing about this is
that the way in which
it manipulates the data
is representative of the structure
of the data that the lens was trained on.
So the idea would be
that we create a lens,
and we create this lens
from the IPv6 addresses that we have,
we make sure that there's an error threshold
in this lens,
and then we take IPv6 addresses
and pass them through the lens,
the data that we get out the other side
is a transformation of our input data
that is transformed in a way
that is hopefully going to guess
our IP addresses.
And so,
because we're so good at machine learning,
this worked completely perfectly
in that
it didn't work at all.
And so we could not
make anything but a perfect lens.
So basically all of the data
that we put into the autoencoder
was just spit right back out at us.
So our input data was our output data
and it really didn't do much for us.
So, you know,
a thousand dollars down,
machine learning is not working,
we're doing great so far.
So then we found this paper
which gave us a little bit of hope,
which is the Entropy IP paper,
which is these folks from Akamai
that looked at an absolutely massive
data set of IPv6 addresses
and did an analysis
of the structure of these addresses.
And their conclusion was that
yes indeed, there is a lot of structure
in these addresses.
So if you look in the top left hand corner,
kind of that blue wavy line,
is the entropy of addresses
on a per bit basis.
So you'll see kind of on the right of that line
you have very high entropy,
so that would be at the end of the address.
So it's going to be your host bits.
You have another spike of really high entropy
right in the middle.
That's probably going to be your slash 64 network boundary
because that's where so many networks are allocated.
But the rest of this seems to have
a fairly low amount of entropy.
So, you know, this was kind of a consolation prize
considering how much we had already done
and not gotten any progress with.
But at least we thought,
okay, we're on the right track.
So we did what we do best,
which is we got really dumb about it.
So we made our own model
for generating IPv6 addresses.
We just needed something that would work,
really anything that would work,
so that we could create this sort of thing
that would generate addresses, scan for them,
and then find live addresses and feedback into the loop.
So this is what we would do.
We would take an IPv6 address
and we would break it down
into the 32 nibbles that it was made up of
and then we would count
the number of occurrences
of every nibble
based on the preceding nibble.
So in this case, we would say,
okay, in position 0,
we have a current value of 0x2
and we'd see the next value is 0x8 one more time.
And then in position 1,
we have a current value of 0x8
and a next value of 0x0 one more time.
And the next one in the second position,
0x0, 0x0, one more time.
And we did this with every single IP address
that we have in our input data set
and what we end up with
is basically a probability distribution
where I can say, okay,
in position 1, when I have a nibble value
of 0xa, here is
the likelihood of what the next nibble
is going to be.
And this actually kind of worked.
So once we have this,
we say, okay, how do we turn this
from a model into creating
an IP address?
We say, okay, we're going to start with an 0x2 in position 0.
We look at our data set and say,
okay, what are the probabilities for the next
nibble from position 0
when the current value is 0x2?
And we have those probabilities
based on the data set.
We use that to create a weighted die
and we roll that die and that
gives us the value for the next position.
So now we have the value for position 1
and we do the same thing again.
Okay, when I have a value of 0xa in position 1,
out of everything we've seen before,
what's the probability distribution?
Do it again, do that 32 times
and then we have an IP address.
And so, another lesson
about research projects,
don't get your hopes up too quickly.
So we generated 10 million
addresses and then we
ping scanned all 10 million addresses
and 50,000 of them responded
saying, yup, I'm a real address.
We're really good at this.
Wow, that was
easier than I thought it would be.
I feel like that shouldn't be a thing.
I feel like that was too easy.
So it turns out
it was too easy because
then we learned about these guys.
So in IPv6
there's this phenomenon
which we still do not know why this happens
but there's these
alias network ranges
which basically every single IP address
in an IPv6
network range is mapped
to a single host.
So they're all going to respond
to a ping.
Now if you have
a loop, a feedback loop
where you generate addresses,
scan for them, and then take the ones that responded
and feed them back into your model
and you don't take
these into account, you very quickly
end up with a model that is completely
useless but is really good at
finding alias network ranges.
And that's what happened to us.
So we had to find a way to
detect and eliminate these
alias network ranges from our data sets.
So here's how we did that.
So let's say that this IP address
responded to a ping and we say, okay,
you know what, I don't trust you anymore.
I've learned my lesson. I'm going to think
that you are actually in an alias network range.
So I'm going to wrap you in a
slash 96, and then
I'm going to generate 8 random addresses
in that network. Again, the assumption
here is that you're
in an alias network range, I'm going to test
the addresses around you to see if
they respond as well, and the chances
of us guessing 8 random addresses
out of 4.5 billion
and those actually being live hosts,
that feels pretty small. I'm not a statistician
either, but I think it's small.
So we ping scan these
guys, and if 50%
or more respond, then we know that
network range is aliased.
But that's just the slash 96.
We want to know where
the network boundary is for the
actual alias network range. So now we've got to find
that boundary.
So we know it's the slash 96,
which means
that the rightmost
32 bits are within the alias
network range, but the other 96 bits
we don't actually know anything about yet.
So what we do is we perform a binary search
We take
the right hand of the bits that we don't
know anything about yet, and we flip them.
Again, the logic being that if these
bits are within the alias network range,
then this IP address should respond to
a ping scan as well.
So we take that,
we ping scan it, and one of two things
happens. Either we
don't receive a response,
which means that the boundary of the
alias network range is within the bits that
we flipped, and the bits that we did not
test are definitely not in the network
range,
or a response was received,
which means that the bits that we tested are
within the alias network range, and now we have
to test the ones that we didn't.
So this is a binary search, we rinse and repeat,
and between five and six
iterations of this, you can actually
find the exact boundary
for where the alias network range
begins and ends.
And that was
the first release of our tool.
So basically,
we have this really stupid probabilistic
model, generate IP addresses,
scan for those IP addresses,
all the IP addresses that respond,
we say, okay, test for alias networks,
remove the ones that are in alias networks,
and then feed them back into the loop
and keep going.
And this worked reasonably well.
We basically
had, we would get one new
novel IPv6 address that we had
never seen before once every
15 seconds, which
not great,
but better than guessing one out of
two to the power of 128
to us.
So this is actually the
fourth time that we've spoken on this
topic, and every time that we give a
talk, we like to make sure that we
add new functionality into our software.
And so one of the most recent releases
focused on getting less
dumber. Not easy for
us to do.
But we wanted to improve our discovery
rate. And so we were going to take
a two-pronged approach to that. I'll talk
about the first prong, Mark will talk about the
second. But first was, let's find a
better way to generate these addresses.
And we read this paper called 6Gen,
which comes out of UC Berkeley, published in
2017, and it's basically
a further iteration on the entropy IP
paper, coming up with some really interesting
ideas around how you can find
new IPv6 addresses.
And kind of the
thing that really kind
of hit home with us
was what their notion of
an IPv6 address cluster was.
And they defined
an IPv6 address cluster as
an IPv6 address
and a number of wildcard
indices. And I'll talk more
about what that means here in a second.
They evaluate how good a
cluster is based on two things.
One of them is the capacity.
How many IPv6 addresses
are possible in this cluster?
And then density. Of all
of those possible IPv6 addresses,
how many of them are in your input
data set?
So let me go back to the
question, what do these clusters actually look like?
So let's say that I wanted to create a cluster
from this IP address.
Say, okay, cool. Now I have
a cluster of size one that has no
wildcard indices, and it's fairly
uninteresting. It's just a single IP address
at this point. The interesting part
comes, though, when we say,
well, what if I upgraded
or grew this cluster
to include a second IP address?
Let's take
this second IP address, for example.
Now the process of growing the
cluster to include a second IP address
is that you compare all of the
nibbles, and you figure out
which nibbles you would have to turn into
wildcards in order for the cluster
to fit both of them.
So in this case, there's only
one nibble that is different between
the two IP addresses. So in order
to upgrade this cluster or grow
this cluster so that it includes both of these
IP addresses, we have to add
a wildcard index at that position.
So basically, in order for
an IP address to be contained within this data
set, or within this cluster,
all of the nibbles have to match, except for
the wildcard indices, because those match
any of them.
In this case, this cluster has a capacity
of 16, because there's 16 possible values
for that one position.
Our input data set was
of size 2, and both of them were found
in the cluster. So in this
case, the density of this cluster is
12.5%.
And really,
to hit this home, these are all
the possible addresses that are in that cluster.
You'll notice that every nibble
is held the same value,
except for in that single position, which is where
the wildcard was.
So the original algorithm,
I'm not going to go into depth on it, but
I will say this. If you're ever reading a
research paper, and
one section says algorithm,
and then the next section says
optimizations for this algorithm so that
you can get it to run on your machine,
it's probably an expensive algorithm.
And it's probably not going to play too well
in software that you want to be running in a constant
loop. I will
say, though, that the algorithm that
we came up with is a direct derivation
of their work, and their
work is very impressive. I highly
recommend reading their paper.
So here's
how we use these clusters to generate
IP addresses. So first we want
to build a really good cluster set.
So we take every IP address
in our input data set, and we
create a cluster of size 1
for every one of them. So now
that we have all these clusters, the next
step is I want to figure out
what the best possible
wildcard index to add
to each cluster is. So I'll take
one cluster and I say, okay, well what if I
turned index 0 into a wildcard?
How good would that cluster be?
What about index 1? What about index 2?
What about index 3?
And in doing this, I figure
out what the best position to add a
wildcard at is, and that's
the best possible upgrade for that cluster.
Now in some cases,
it doesn't matter which index
you pick, they're all bad choices,
which means effectively that
this IP address is kind of out there in the ether
and has no adjacent
IP addresses, in which case we don't really
want to build a clustering model around it.
We just kind of set it off to the side
and we reuse it later.
So now that we have all of these clusters
and we have all of these upgrades,
we sort the upgrades by density
because we want to pick the densest upgrades first,
and then we take the
best upgrade off the top, put it
into our cluster set, and then
recalculate what the next best upgrade
for that new cluster might be, and put it
back in the list of upgrades.
So we take the best upgrade off the top,
recalculate what its upgrade would be, put it back in the
upgrade candidates, and rinse and repeat.
And we keep doing this
based on a score.
So we have a mechanism for scoring
the overall utility
of the cluster set that we're building,
and it's based on capacity and density.
And what you'll find is that
when you first start doing this, the score
goes up, up, up, up, up, and then it
plateaus, and then it starts coming down.
And we want to find the model
at the peak of this curve.
So once we have that,
once we have that,
we want to use that to generate
new IP addresses.
And to do that, we basically say, okay,
we pick a cluster at random from the set
of clusters that are in our model,
and then for every one of the 32
nibbles that we need to generate, we flip a
coin, and if the coin lands
on heads, we generate from the cluster.
If the coin lands on tails, we generate
from a probability distribution
of all the addresses we didn't
put in the clustering model.
So if we're generating from the
cluster set, we basically say, okay,
is this index a wild card?
If so, just pick a random nibble.
If not, just take the value
at that nibble. Or,
if we're getting it from the probability distribution,
we say, okay, for this nibble position,
what is the probability distribution
across all addresses that aren't in the cluster set?
Create a way to die, roll it,
and that gives us the value. We do that
32 times, that gives us an IP address,
and this actually works pretty well.
And so the second part of
what we wanted to do to kind of improve
our identification
or detection accuracy is
fan out from initial discovery points,
which is what Mark's going to talk about next.
So as Chris explained,
we have the 666 gen algorithm,
and we use this to generate our
potential candidate IPv6 addresses
We ICMP ping scan them,
and we have a set of discovered addresses.
From that initial set of discovered
addresses, we then want to fan out
and use those as starting points and assume
that we're going to have either neighboring networks,
neighboring hosts, or similar addresses
in structure.
And so for the similar addresses in structure,
we do something which we call a nibble-adjacent
fan out. And the premise here is that
because there is inherent structure in IPv6
addresses, if we've discovered one address,
we think it's likely that we're going
to find other addresses by varying
at most one nibble from that address.
And so in this example, we have
a target network of 2000 slash 4,
and the slash 4 tells us that 4 bits
are going to be the network mask,
which means the first nibble, the 2,
is going to be static, and we can vary
the other 31 nibbles in the address.
So we start in this case with the last
nibble in this address. It starts at a
value of 1, so we generate one
new candidate address by setting that last
nibble to 0. We generate another candidate
address by setting that to 2, 3,
and so forth, and we've now generated
15 new candidate addresses by
changing that one last nibble.
In this case, because our target network is a
slash 4, we have 31 nibbles that we can
vary, so we go all the way down and we
do the same procedure across all of the
31 nibbles. For each nibble,
we generate 15 new candidate addresses,
and so this gives us between 15
and 465 new candidate addresses
for every newly discovered address.
And one of the reasons we do this is scanning
is cheap, generation is hard,
and by using just these small derivations
from previously discovered addresses,
we have what we perceive to be a high
likelihood of finding new addresses.
And then we do a similar technique looking at
sequentially increasing and decreasing
network addresses. So again,
if we find an address which is
a colon colon 1 at a slash 64,
this means the first address at a slash
64 network, we're going to assume
that this may be a piece of networking equipment,
maybe some infrastructure gear, or consumer premises
equipment, and if we are an ISP
or a networking provider, we might
think that you would do monotonically
increasing and decreasing network
addresses. So we take this address,
we take the leftmost 64 bits,
and we increase that and we decrease
that to look for potentially neighboring
addresses that are still at that same colon colon
1, just different slash 64s.
And then we do a similar technique
looking for neighboring hosts. So if we have
this colon colon 1 slash 64,
we're going to assume that is going to be
a router which may be doing monotonically
increasing DHCP leases,
which means we might expect to see devices
at colon colon 2, colon colon 3,
and so forth. So we generate a bunch
of these neighboring slash 64 networks,
slash 64 hosts, as well as
the nipple adjacent fanout candidates.
And what we discovered is that
by combining this fanning out
with the 666 gen algorithm,
we're actually able to generate a huge
amount more addresses than we were with our initial
implementation. So here
we have in v0.2,
we were able to discover around 60,000
addresses over the course of 8 days,
and 80% of those were novel
addresses not in our public datasets.
And this wasn't great, but it was a good
start, and it would show that we could actually find
these new IPv6 addresses and their
connected hosts. Now with version 0.3
where we added 666 gen
and the fanning out, this was a remarkable
improvement. And we're able to discover
in this case 1.57 million
new addresses
in the course of one hour, and this
approached 10 or more million in 24 hours,
and 78% of these were not
in publicly available datasets.
And so this is a 503,000%
improvement, and this is validation that
all of this hard work led us to
a set of algorithms that we can actually
discover a huge amount of new IPv6
addresses. Yeah, this is
less a testament to how good we are
now and how bad we were before.
Yes. And so we
took a sample of 100,000 addresses
from this dataset, and we decided to
do a TCP connect scan across a number of
common TCP ports. And our goal here was
to sanity check our work and make sure that we
weren't fooling ourselves at having
false positives. And in this test
we had quite a few devices with
open TCP ports. We found lots
of networking equipment, both infrastructure
and consumer premises equipment.
We found completely unauthorized
MongoDB instances, lots of ancient
SSH and Telnet services, and this
was the first validation we had seen of
our initial hypothesis, was that
there were likely lots of IPv6 connected
hosts on the internet that were potentially
misconfigured or not intended to be exposed.
And now we're a year and a half into this
project, and we finally have the
data and tools to start to validate
our initial hypotheses. And so
a year and a half later we can actually
start the research that we wanted to start
way back then.
So now Chris is going to take you to the cloud.
Right up to the cloud.
Alright, so
as I said before, we don't like giving the same
talk twice.
And so here's what we've done for
this release of our software.
So we, last
release, we're like, oh my gosh,
our discovery rate is so much
better, this is really cool. And then we
found that, you know, you can actually run two
instances of this software side by side
and they would find mostly unique addresses.
Which means that this is pretty horizontally
scalable. So we started thinking,
well it would be really cool to aggregate
all of this data and provide this data back
to the community. So what we've done
is we added a little bit of functionality
into the tool where you can opt in and
say, I'm okay sharing the data that I'm collecting.
And it uploads it to our central server
and then all of that data is queryable
on our website, which is
ipv6.exposed.
And so, you know, maybe
you want to run the tool, maybe you want to like send some
data to us, maybe you want to provide some data, or maybe
you just want to query all the data that we
have aggregated thus far.
It's all cool by us, but we hope to
enable other folks to start
taking a look at IPv6
security posture by doing this.
We did not have time to
put the link into this slide,
but I will make sure to have it posted on our
GitHub later.
And so our tool set is known as
ipv666.
The 0.4 release
is for pushing scan results to the cloud,
opting in for that. If you have
golang 1.11 or higher
installed, just type in go git
and then this URL. I'll leave this
up here in a minute when we get
to Q&A. But there's a number
of things that it can do.
We can do the address discovery.
We can test to see if
a network is aliased. We can generate
just random addresses.
We can generate a new model if you have addresses
you want to create one from.
We can generate a blacklist to make it so that you don't
have a particular address range.
We can clean a list based on the contents of the
blacklist that shipped. Or you can convert
files containing IPv6 addresses to
various representations.
And again, we'll leave the links up here
in a second, but I'll let
Mark finish.
So we started by
talking about how we got into this research and
how it was an offshoot of our Comcast vulnerability
research that we presented at DEF CON 25.
We looked at the difficulty of scanning
this IPv6 address space due to this
128-bit search space being too long.
We talked about the failures
we had with honeypotting, the
marginal results we had with the initial
probabilistic modeling, and the much
better results we had by introducing 666
and the fanning out.
We then looked at the results from our scanning
and some initial port scans. Chris talked about
pushing our results to the cloud so that we can
generate this global cloud source
dataset of IPv6 addresses.
We touched on the IPv666
toolset we've released, which we encourage you to use.
And I just want to say that
we've spent a year and a half on this.
We've done a lot of learning. We've made a lot of mistakes.
But this is how research projects go.
And we encourage you to work on your own research.
We hope this can contribute to that.
And we welcome contributions to the codebase.
And thank you very much.
Here we have a number of links that
had inspired us. We recommend taking a look at these
if you are interested in this project.
I'll put the links back up.
Yes, thank you all.
Thank you.
